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^ ' ABSTRACT: Ordered sequences of univariate or multivariate regressions provide sta- 

[ tistical models for analysing data from randomized, possibly sequential interventions, 

from cohort or multi-wave panel studies, but also from cross-sectional or retrospective 
OO I studies. Conditional independences are captured by what we name regression graphs, 

provided the generated distribution shares some properties with a joint Gaussian dis- 
tribution. Regression graphs extend purely directed, acyclic graphs by two types of 
undirected graph, one type for components of joint responses and the other for com- 
^ . ponents of the context vector variable. We review the special features and the history 

of regression graphs, prove criteria for Markov equivalence and discuss the notion of a 
^ ■ simpler statistical covering model. Knowledge of Markov equivalence provides alterna- 

^ ' five interpretations of a given sequence of regressions, is essential for machine learning 

CS| ' strategies and permits to use the simple graphical criteria of regression graphs on graphs 

^ I for which the corresponding criteria are in general more complex. Under the known con- 

I ditions that a Markov equivalent directed acyclic graph exists for any given regression 

^ ' graph, we give a polynomial time algorithm to find one such graph. 

Key words: Chain graphs. Concentration graphs, Covariance graphs. Graphical Markov 
^ I models. Independence graphs. Intervention models. Labeled trees. Lattice conditional 

I independence models. Structural equation models. 

1 Introduction 

A common framework to model, analyse and interpret data for several, partially ordered 
joint or single responses is a sequence of multivariate or univariate regressions where 
the responses may be continuous or discrete or of both types. Each response is to be 
generated by a set of its regressors, called its directly explanatory variables. Based 
on prior knowledge or on statistical analysis, one is to decide which of the variables in 
a set of potentially explanatory ones are needed for the generating process. Thus, for 
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each response, a first ordering determines what is potentially explanatory, named the 
past of the response, and what can never be directly explanatory, named the future. 
Furthermore, no variable is taken to be explanatory for itself. 

Corresponding regression graphs consist of nodes and of edges coupling dis- 
tinct nodes. The nodes represent the variables and the edges stand for condi- 
tional dependences, directed or undirected. The directly explanatory variables for an 
individual response variable Yi show in the graph as the set of nodes from which arrows 
start and point to node i. These nodes are commonly named the parents of node i. 

Every missing edge corresponds to a conditional independence statement. Edges 
are arrows for directed dependences and lines for undirected dependences 
among variables on equal standing, that is among components of joint responses or 
of context variables. Undirected dependences are often also called associations. A given 
regression graph reflects a particular type of study which may be a simple experiment, 
a more complex sequence of interventions or an observational study. 

One of the common features of pure experiments and of sequences of interventions 
with randomized, proportional allocation of individuals to treatments, is that, by study 
design, some variables can be regarded to act just like independent random variables. For 
instance, in an experiment with proportional numbers of individuals assigned randomly 
to each level combination of several experimental conditions, the set of explanatory vari- 
ables contains no edge in the corresponding regression graph, reflecting a situation like 
mutual independence. Similarly, with fully randomized interventions, each treatment 
variable has exclusively arrows starting from its node but no incoming arrow. After 
statistical analysis, some conditional independences may be appropriate additional sim- 
plifications which show as further missing edges. 

Sequences of interventions give a time ordering for some of the variables. A time 
order is also present in cohort or multi-wave panel studies and in retrospective studies 
which focus on investigating effects of variables at one fixed time point in the past, 
without the chance of intervening. By contrast, in a strictly cross-sectional study, in 
which observations for all variables are obtained at the same time, any particular variable 
ordering is only assumed rather than implied by actual time. 

The node set is at the planning stage of empirical studies ordered into ordered 
sequences of single or joint responses, Ya, Yf,, Y^. . . that we call blocks of variables on 
equal standing and draw them in figures as boxes. This determines for the following 
statistical analyses that within each block there are undirected edges and between blocks 
there are directed edges, the arrows. The first block on the left contains the primary 
responses of Ya and the last block on the right contains context variables, also 
named the background variables. After statistical analyses, arrows may start from 
nodes within any block but always end at a node in one of the blocks in the future. 
Thus, there are no arrows pointing to context variables and all arrows point in the 
same direction, from right to left. An intermediate variable is a response to some 
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variables and also explanatory for other variables so that it has both incoming and 
outgoing arrows in the regression graph. 

As an example, we take data from a retrospective study with 283 adult females an- 
swering questions about their childho od when visi t ing th eir general practitioner, mostly 
for some minor health problems; see iHardt et al.l (120081 ). A well-fitting graph is shown 
in Figure [1] It contains two binary variables. A, B and six quantitative variables. Ex- 
cept for the directly recorded feature age in years, all other variables are derived from 
answers to questionnaires, coded so that high values correspond to high scores. 

The three blocks a, 6, c reflect here a time-ordering of vector variables, F^, Y^, Y^. 
with Ya representing the joint response of primary interest, yj, an intermediate vector 
variable and Yc a context vector variable. The three individual components of the 
primary response Ya are different aspects of how the respondent recollects aspects of 
her relationship to the mother. The intermediate variable Yi, has two components that 
reflect severe distress during childhood. The three components of the context variable 
Yc capture background information about the respondent and about her family. 

The graph of Figure [H derived after statistical analyses, shows among other inde- 
pendences that Ya is conditionally independent of Y^. given yj,? written compactly in 
terms of sets of nodes as a X c\b. None of the components of Yc has an arrow pointing 
directly to a component of Fq, but sequences of arrows lead indirectly from c to a via h. 




Figure 1: A well-fitting regression graph for data on n = 283 adult females; within boxes 
are Fa, Y^, Yc-, corresponding ordered partitioning of the node set on top of the boxes. 



This says, for instance, that prediction of Ya is not improved by knowing the context 
variable Yc if information on the more recent intermediate variable Y^ is available. More 
interpretations of the independences are given later. When some edges are missing and 
each edge present corresponds to a substantial dependence, the graph may also be viewed 
as a research hypothesis on \yhich y ariables are needed to generate the joint distribution; 



see 



Wermuth and LauritzenI (jl990l ). The goodness-of-fit of such a hypothesis can be 



tested in future studies. 

Two models are Markov equivalent whenever their associated graphs capture the 
same independence structure, that is the graphs lead to the same set of implied 
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independence statements. Markov equivalent models cannot be distinguished on the 
basis of statistical goodness-of-fit tests for any given set of data. This may pose a prob- 
lem in machine learning contexts. More precisely, knowledge about Markov equivalent 
models is essential for designing search pr ocedures that conve r ge wit h an increasing sam- 
ple size to a true generating graph; see ICastelo and Kockal (120031 ) for searches within 
the class of directed acyclic graphs, which consist exclusively of arrows and capture 
independences of ordered sequences in single response regressions. 



a) 



b) 
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Figure 2: Two Markov equivalent graphs to the one of Yt„Yc of Figure [TJ 



More importantly though, Markov equivalent models may offer alternative interpre- 
tations of a given well-fitting model or open the possibility of using different types of 
fitting algorithms. 

As we shall see in Section 7, the graph for nodes A, R, B, P, Q in blocks b and c of 
Figure [1] is Markov equivalent to both graphs of Figure [2l From knowing the Markov 
equivalence to the graph in Figure [2^), the joint response model for YJ, given Ya may 
also be fitted in terms of univariate regressions and from the Markov equivalence to the 
graph in Figure one knows for instance directly, using Proposition 1 below, that 
sexual abuse is independent of age and schooling given knowledge about family distress 
and family status. 



Regression graphs are a subclass of the maximal ancest ral .qmp/is of [Richardson and Spirtes 



(120021 ) and these are a subclass of the summary graphs of IWermuthI (120111 ). The two 
types are called corresponding graphs if they result after marginalising over a node 
set m and conditioning on a disjoint node set c from a given directed acyclic graph. Both 
are independence-preserving graphs in the sense that they give the independence 
structure implied by the generating graph for all the remaining nodes and further condi- 
tioning or marginalising can be carried out just as if the possibly much larger generating 
graph were used. The summary graph permits, in addition, to trace possible distortions 
of generating dependences as they arise in conditional dependences among the remaining 
variables, for instance in parameters of the maximal ancestral graph models. 

In the following Section 2, we introduce further concepts and the notation needed to 
state at the end of Section 2, some of the main results of the paper and related results 
in the literature. In Section 3, a well-fitting regression graph is derived for data of 
chronic pain patients. Sections 4, 5 and 6 may be skipped if one wants to turn directly 
to formal definitions, new results and proofs in Section 7. Section 4 reviews linear 
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recursion relations that are mimicked by graphs and lead to the standard and to special 
ways of combining probability statements, summarized here in Section 5. In Section 6, 
some of the previous results in the literature for graphs and for Markov equivalences are 
highlighted. The Appendix contains details of the regressions analyses in Section 3. 



2 Some further concepts and notation 

Figure [3] shows five ordered blocks, to introduce the notion of connected components of 
the graph to represent conditionally independent responses given their common past. 



Yb 



Yc 



Yd 



Y. 



Primary 

responses variables 



Intermediate Intermediate Intermediate Context 
variables variables variables 



Figure 3: A typical first ordering: here of five vector variables, Ya, . . .1^; primary response 
Ya listed on the left, context variable 1^ on the right, intermediate variables in between. 




{91} {92,93} {54,35} {56,57} {58,59} 



Figure 4: A regression graph for 14 variables corresponding to blocks a to e of Figure [3l 

In the example of a regression graph in Figure |4] corresponding to Figure [3l Fa is a 
single response, Yb has two component variables, both of Y^ and Yg have four and Y^ has 
three. Each of the blocks 6 to e shows two stacked boxes, that is subsets of nodes that 
are without any edge joining them. This is to indicate that disconnected components of 
a given response are conditionally independent given their past and that disconnected 
components of the context variables are completely independent. 

Graphs with dashed lines are covariance graphs den oted by G^„, th ose with 



full lines are concentration graphs denoted by Gconi see lWermuth and Coxl ( 1l998l ). 



The names are to remind one of their parametrisation in regular joint Gaussian dis- 
tributions, for which the covariance matrix is invertible and gives the concentration 
matrix. A zero zfc-elemen t in means i -^ k and a zero z/c-elernent i n means 
iALk\{l, . . . ,d} \ {i, k}; see IWermuthI f ll976a[ ) or ICox and WermuthI ( 119961 ) . Section 3.4. 
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The regression graph of Figure H] is consistent with the first ordering in Figure [3] since 
no additional ordering is introduced, as it would have been by arrows within blocks a 
to e. After statistical analysis, blocks of the first ordering are often subdivided into 
the connected components of the graph, gj, shown here in Figure H] with the help of 
the stacked boxes. For several nodes in g^, each pair of nodes (i, k) is connected by at 
least one undirected i/c-path within g^. An ik-path connects its endpoint nodes i, k via 
a sequence of edges coupling distinct other nodes along the path, named the path's 
inner nodes. 

For a regression graph, , the node set has an ordered partitioning into two 
subsets, A^ = [u, v) distinguishing response nodes within u from context nodes within 
V. The connected components gj, for j = 1, ... J, are the disconnected, undirected 
graphs that remain after removing all arrows from the graph. Thus, the displayed, 
stacked boxes in Figure H] are just a visual aid. We say that there is an edge between 
subsets a and 6 of A^ if there is an edge with one node in a and the other node in b. 
Then, the subgraph induced by nodes aUb is said to connected in a and b. 

For any one block of stacked boxes, different orderings are possible. We speak of a 
compatible ordering if each arrow starting at a node in any gj points to a node in 
g<j = giU ■ ■ ■ U g-j^i, but never to a node in g^j = gj^i U ■ ■ ■ U gj, the past of gj. 

Full lines are edges coupling context variables within v. Dashed lines couple joint 
responses within u. The regression graph is complete if every node pair is coupled. In 
this case, the statistical model is saturated as it is unconstrained for some given family 
of distributions. 

Let gi, . . . gj denote any compatible ordering of the connected components of , 
then a corresponding joint density factorises as 



into sequences regressions for the joint responses gj within u and for separate concen- 
tration graph models in disconnected gj within v. 

In a generating process of /jv over a regression graph, one starts with the 
density of gj continues with the one of gj-i given gj up to the density of gi given g^i 
so that ([T]) is used for one given compatible ordering of the node set A^. Every ik-edge 
present denotes a non-vanishing conditional dependence of Yi and given some vector 
variable Yc, written as i i+i k\c so that the graph is said to represent a dependence base 
or to capture a dependence structure. The generating process attaches the following 
meaning to each ik— edge present in 






(2) 
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Notice that only for context variables, conditioning is on all other context variables 
while for responses conditioning is exclusively on variables in their past. When the 
dependence sign rtl is replaced by the independence sign _LL, equations (2) give with 
missing edges for node pairs i, k the pairwise independence statements defining 
the independence structure of , given the composition and the intersection 
property discussed below. 

An equivalent, more compact description of the set of defining pairwise indepen- 
dences and a proof of equivalence of this pairwise Markov property to the global 
Markov property has been given for the class of mixed loopless graphs, which contain re- 
gressio n graphs as a su b class; s ee Sadeghi and Lauritzen ( 2 011); see also lKang and Tian 



fl2009l ). iPearl and Pad (119871 ). iMarchetti and Lupparellil (120111 ) for relevant, previous 



results. A global Markov property permits to read off the graph all independence 
statements implied by the graph. 

Equation (I2])(i) holds for the conditional covariance graphs of joint responses gj 
within u having dashed lines as edges, ©(w) for the dependences of the single responses 
within Qj on variables in the past of Qj having arrows as edges and equation ([2]) [in) for 
the concentration graph of the context variables within v having full lines as edges. 
For instance, from the definition of the missing edges corresponding to ([2]), one can 
derive for Figured SMU\bc by ©(zz), PMQ\B by ©(m), and both AALB\PQ and 
AALP\BQ by ©(i) using first principles and the two special properties of the generated 
distributions named composition and intersection. 

Notice that each missing edge of a regression graph corresponds to an indepen- 
dence statement for the uncoupled node pair; see also Lemma [2] and Lemma [3] below. 
Therefore, regression graphs represent one special class of the so-called independence 
graphs. Whenever a regression graph consists of two disconnected graphs, for 
Ya and 1^ say, since no path leads from a node in a to a node in b, and aUb = N, then 
a_LL6 or /at = fafb, and the two vector variables may be analysed separately. Therefore, 
we treat in Section 7 of this paper only connected regression graphs. 

All graphs discussedm this paper have no loops, that is no edge connects a node 
to itself and they have at most one edge between two different nodes. Recall that 
an ifc-path in such a graph can be described by a sequence of its nodes. By convention, 
an i/c-path without inner nodes is an edge. For every ik-edge, the endpoints differ, i ^ k. 
An zfc-path with i = k has at least three nodes and is called a cycle. 

A three-node path of arrows may contain only one of the three types of inner nodes 
shown in Figure called transition, source and sink node, respectively. 

a) b) c) 
0-* — o< o o-< — o *o o — ►o-^ o 



Figure 5: The three types of three-node paths in directed acyclic graphs with inner nodes 
named a) transition, b) source, c) sink node (or in directed acyclic graphs: colUsion node). 
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A path is directed if all its inner nodes are transition nodes. In a directed cycle, 
all edges are arrows pointing in the same direction and one returns to a starting node 
following the direction of the arrows. A regression graph contains no directed cycle and 
no semi-directed cycles, which have at least one undirected edge in an otherwise 
directed cycle. If an arrow starts on a directed i/c-path at k and points to i then node 
k has been named an ancestor of node i and node i a descendant of node k. 

The subgraph induced by a subset a of the node set consists of the nodes 
within a and of the edges present in the graph within a. A special type of induced sub- 
graph, needed here, consisting of three nodes and two edges, is named a V -configuration 
or just a V. Thus, a three-node path forms a V if the induced subgraph has two edges. 

An ik-path is chordless if for each of its three consecutive nodes {h,j, k), coupled 
by an hj- edge and and j/c-edge, there is no additional hk-edge present in the graph. In 
a chordless cycle of four or more nodes, the subgraph induced by every consecutive 
three nodes forms a V in the graph. An undirected graph is chordal if it contains 
no chordless cycle in four or more nodes. 

In regression graphs, there may occur the three types of collision Vs of Figure O 

a) b) c) 

O O O O O O O-^ O 

Figure 6: The three types of collision Vs in regression graphs: a) undirected, b) directed or 
sink-oriented, c) semi-directed; for uncoupled path endpoints, the inner node is excluded from 
every independence statement that the graph implies for these endpoints. 

Notice that in a directed acyclic graph, the only possible collision V is directed and 
coincides with the sink V of Figure [5t). 

An important common feature of the three Vs of Figure [6] is that the inner node 
is excluded from every independence statements for its endpoints; see and Lemma 
[2J In all other five possible types of V-configurations of a regression graph, named 
transmitting Vs, the inner node is instead included in the independence statement for 
the endpoints; see ([2]) and Lemma |3] below. Notice that for uncoupled endpoints, both 
paths a) and b) of Figure E] are transmitting Vs. Similarly, the definition of transmitting 
and collision nodes remains unchanged if the Vs in Figure M are interpreted as ik-paihs 
for which there may be an additional ik-edge present in the graph. 

A collision path has as inner nodes exclusively collision nodes, while a trans- 
mitting path has as inner nodes exclusively transmitting nodes. A chordless collision 
path in four nodes contains at least one dashed line. In particular, it is impossible to 
replace all the edges in such a four-node path by arrows and not generate at least one 
transmitting V. Thereby, the meaning of this missing edge would be changed and hence 
contradict its unique definition given from the generating process. The skeleton of a 
graph results by replacing each edge present by a full line. Now, two of the main new 
results of this paper can be stated. 
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Theorem 1. Two regression graphs are Markov equivalent if and only if they have the 
same skeleton and the same sets of collision Vs, irrespective of the type of edge. 



Theorem 2. A regression graph with a chordal graph for the context variables can he 
oriented to he Markov equivalent to a directed acyclic graph in the same skeleton, if and 
only if it does not contain any chordless collision path in four nodes. 

Sequences of regressions were introduced a nd studied, without specif y ing a concen- 

t rat io n graph model for the context variables, by Cox and Wermuth f IQQsl ). Wermuth and Cox 
fl2004| ). under the name of multivariate regression chains, reminding one of the sequences 
of unconstrained models that the class contains for Gaussian joint responses. An exten- 
sion to graphs in cluding a concentration graph ha d already been proposed for directed 
acyclic graphs bv lKiiveri. Speed and CarlinI (jl984l ). By this type of extension, the global 
Markov property of the graph remains unchanged. 



A criterion for Markov equivalence of summary graphs has been derived by ISadeghi 



(120091 ) who also shows that two different c r iteria for m aximal ancestral graphs are 



equiva lent, those due to lZhao. Zheng and Liul ( 120051 ) and to lAli. Richardson and Spirtes 



(120091 ). These available Markov equivalence results and the associated proofs increase 
considerably in complexity, the larger the model class. On the other hand, the Markov 
equivalence criterion of Theorem [T] is simple and includes as special cases all available 
equivalence results for directed acyclic graphs, for covariance graphs and for concentra- 
tion graphs, as set out in detail in Sections 6 and 7 here. 

For context variables taken as given, Gaussian regression graph models coincide with 
a large subclass of structural equation models (SEMs), those permitting local modeling 
due to the factorisation property ([T]) and they are without any endogenous responses. 
Such responses have residuals that are correlated with some of its regressors so that the 
so-called endogeneity problem is generated, by which, for joint Gaussian distributions, a 
zero equation parameter need not correspond to any conditional independence statement 
and a nonzero equation parameter is not a measure of conditional dependence. The 
consequence is that ordinary least squares estimates of such equation parameters are 



typically strongly distorted. This was recognized by iHaavelmd ( 119431 ) who received a 
Nobel prize in economics for this insight in 1989 



Joreskog 


(1981 


), 


Bollen 


(1989 


), 


Kline 



(120061 ). while jPearll ( l2009l ) advocates SEMs as a framework for causal inquiries. In the 
econometric literature forty years ago, independences were always regarded as 'overi- 
dentifying' constraints. 

For discrete variab les, more attractive features of regression graph models were de- 
rived by iDrton (l2009h . who speaks of chain graph models of type IV for multivariate 
regression chains in the case all variables on equal standing have covariance graphs. 
He proves that each member in this class bel ongs to a c urved exponential family, for 
a discussion of this notion see, for instance, ICoxl (120061 ) . Section 6.8. Discrete type 
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IV models form also a subc l ass of marginal models; see iRudas. Bergsma and Nemeth 
iBergsma and RudasI Local independence statements that invol ve only 

variab les in the past are equivalent to r aore c omplex local independences used bv iDrton 



fl2009h : see iMarchetti and Lupparellil fl201lh . These local definitions imply the pair- 
wise independence formulation for missing edges corresponding to equation ([2]) for any 
regression graph, . 

Two other types of chain graph have bee n studied as joint response models in statis- 
tics, the so-called AMP chai n prayhs of lAndersson. Madigan and PerlmanI f 2001 '). 



and 



Frvdenbergj fll990l ) 



and the LWF chain graphs of iLauritzen and WermuthI (1198 
They use the same factorisation as in equation ([1]), but they are suitable for modeling 
data from intervention studies only when they are Markov equivalent to a regression 
graph. The reason is that the conditioning set for pairwise independences of responses 
includes in general other nodes within the same connected component. For AMP graphs, 
the independence form of equation ([2]) (i) is replaced by 

(i') i ALk\gyj^i \ {i, k} for i, k both within a response component gj 



while (|2]) {a) and ([2]) {in) remain unchanged. For LWF graphs, 
{i') and the independence form of {ii) by 



is also replaced by 



i -lLk\gyj_i \ {i, k} for i within a gj and k in g^j. 



As a consequence, each undirected subgraph in an AMP chain graph is a concentration 
graph, and an LWF chain graph consists of sequences of concentration graphs. For 
the corresponding different types of pa rametrisations of joint Gaussian distributions see 
Wermuth. Wiedenbeck and Coxl (120061 ). 

Not yet systematically approached is the search for covering models that capture 
most but not all independences in a more complex graph but which may be easier 



to fit than the reduced model; see ICox and WermuthI (ll990l ). For regression graphs 



details are explained here for a small example in Section 4, and in Section 7, first results 
are given in Propositions M to [10] and discussed using Figures [16] and [17] 

Before we turn to the different types of r nissing edges in m ore detail, we derive a 
well- fitting regression graph for data given by iKappesserl (119971 ). 



3 Deriving and interpreting a regression graph 

For 201 chronic pain patients, the role of the site of pain during a three week stay 
in a chronic pain clinic was to be examined. In this study, it was of main interest to 
investigate the changes in two main symptoms before and after stationary treatment and 
to understand determinants of the overall treatment success as rated by the patients, 
three months after they had left the clinic. Figure [7] shows a first ordering of the variables 
derived in discussions between psychologists, physicians and statisticians. 
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The first ordering of tlie variables gives for each single or joint response a hst of its 
possible explanatory variables, shown in boxes to the right, but in Figure [7] only those 
variables are displayed that remained after statistical analyses relevant for the responses 
of main interest. 

Selecting for each response all its directly explanatory variables from this list and 
checking for remaining dependences among components of joint responses, provides 
enough insight to derive a well-fitting regression graph model. With this type of local 
modeling, the reasons for the model choice are made transparent. 

Of the available background variables, age, gender, marital status and others, only 
the binary variables, level of formal schooling (l:=less than ten years, 2:= ten or more 
years) and the number of previous illnesses in years (min:=0, max:=16) are displayed 
in the far right box as the relevant context variables. The response of primary interest, 
self- reported success of treatment, is listed in the box to the far left. It is a score that 
ranges between and 35, combining a patient's answers to a specific questionnaire. 



Y, 

success 
of 

treatment 



after 

treatment 
Za. 

intensity 
of pain 

Xa, 

depres- 
sion 



Primary Secondary 
response responses 



before 
treatment 

intensity 
of pain 

depres- 



U, 

clironi- 
city 
of pain 



A, 

site 

of 

pain 



Intermediate 
variables 



B, 

level of 

formal 

schooling 

V, 

number 
of 

previous 
illnesses 

Context 
variables 



Figure 7: First ordering of variables in the chronic pain study. There are two joint responses, 
intensity of pain and depression. They are the main symptoms of chronic pain, measured here 
before and after treatment. The components of each response are to be modeled conditionally 
given the variables listed in boxes to their right. 

There are a number of intermediate variables. These are both explanatory for some 
variables and responses to others. Of these, two are regarded as joint responses since 
they represent two symptoms of a patient, intensity of pain and depression. Both are 
measured before treatment and directly after the three-week stationary stay. Ques- 
tionnaire scores are available of depression (min:=0, max:=46) and of the self-reported 
intensity of pain (min:=0, max: =10). Chronicity of pain is a score (min:=0, max: =8) 
that incorporates different aspects, such as the frequency and duration of pain attacks, 
the spreading of pain and the use of pain relievers. In this study, the patients have one 
of two main sites of pain, the pain is either on their upper body, 'head, face, or neck' or 
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on their 'back'. 

A well-fitting regression graph is shown in Figure [HI The graph summarizes some 
important aspects of the results of the statistical analyses for which details are given 
in the Appendix. In particular, it tells which of the variables are directly explanatory, 
that is which are important for generating and predicting a response, by showing arrows 
that start from each of these directly explanatory variables and point to the response. 



intensity of pain 
after and before 
treatment 




depression 
after and before 
treatment 



U, 

clironicity 
of pain 



V, 

number of 
previous 
illnesses 



Figure 8: Regression graph, well compatible with the data, that results from the reported 
statistical analyses. Discrete variables are drawn as dots, continuous ones as circles. 

Variables listed to the right of a response but without an arrow ending at this re- 
sponse do not substantially improve the prediction of the response when used in addition 
to the directly explanatory variables. For instance, for treatment success, only the pain 
intensity after the clinic stay is directly explanatory and this pain intensity is an impor- 
tant mediator (intermediate variable) between treatment success and site of pain. 

Y, success of treatment 

40 J 




2 4 6 8 10 
Zg, intensity of pain after treatment 



Figure 9: Form of dependence of primary response Y on Za- 
Scores of self-reported treatment success are low for almost all patients with high 
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pain scores after treatment that is for scores higher than 6; see Figure M Otherwise, 
treatment success is typically judged to be higher the lower the intensity of pain after 
treatment. This explains the nonlinear dependence of Y on Za- 

As mentioned before, for back pain patients, the chronicity scores are on average 
higher than for head-ache patients and connected with a higher chronicity of the pain 
are higher scores of depression. These patients may possibly have tried too late, after 
the acute pain had started, to get well focused help. Both before and after treatment, 
highly depressed patients tend to report higher intensities of pain than others. 

The study provides no information on which variables may explain these depen- 
dences between the symptoms that remain after having taken the available explanatory 
variables into account. However, hidden common explanatory variables may exist in 
both cases since these remaining dependences between the symptoms do not depend 
systematically on any other observed variable. 

Some variables are indirectly explanatory. An arrow starts from an indirectly 
explanatory variable, and points via a sequence of arrows and intermediate variables to 
the response variable. For instance, the level of formal schooling and the site of pain 
are both indirectly explanatory for each of the symptoms after treatment and for the 
overall treatment success. 

Once the types and directions of the direct dependence are taken into account, 
the regression graph helps to trace the development of chronic pain, starting from the 
context information on the level of schooling and the number of previous illnesses of a 
patient. Thus, patients with more years of formal schooling are more likely to be chronic 
head-ache patients. Patients with a lower level of formal schooling are more likely to 
be back-ache patients, possibly because more of them have jobs involving hard physical 
work. Back-ache patients reach higher stages of the chronicity of pain and report higher 
intensity of pain still after treatment and are therefore typically less satisfied with the 
treatment they had received. 

Graphical screening for nonlinear relations and interactive effects (Cox and 
Wermuth, 1994) pointed to the nonlinear dependence of treatment success on intensity 
of pain after treatment but to no other such relations. The regression graph model is said 
to fit the data well because for each single response separately, there is no indication 
that adding a further variable would substantially change the generated conditional 
dependences. The seemingly unrelated dependences of the symptoms after treatment 
on those before treatment agree so well with the observations that they differ also little 
from regressions computed separately, see the appropriate tables in the Appendix. 

Had there been no nonlinear relation and no categorical variables as responses, the 
overall model fit could also have been tested within the framework of structural equation 
models once the regression graph is available. This graph is derived here with the local 
modeling steps that use the first ordering of the variables, just in terms of univariate, 
multivariate and seemingly unrelated regressions. The regression graph provides a hy- 
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pothesis that may be tested locally and/or globally in future studies that include the 
same set of nine variables. In this case, no variable selection strategy would be used or 
needed. 

The available results for changes of the regression graph (Wermuth, 2011) that result 
after marginalising and conditioning provide a solid basis for comparing the results of 
any sequence of regressions with studies that contain the same set of core variables but 
which have some of the variables omitted or which consider subpopulations, defined by 
levels or level combinations of other variables. For instance for comparisons with the 
current study, the same chronicity score may not be recorded in another pain clinic or 
data may be available only for patients with pain in the upper body. 

The main substantive results of this empirical study are that site of pain 
needs to be taken into account also in future studies since it is an important mediator 
between the intrinsic characteristics of a patient, measured here by the given context 
variables, for both the overall treatment success and for the symptoms after treatment. 
For back-ache patients, the chronicity of pain and the depression score is higher than 
for the head-ache patients and the treatment is less successful since the intensity of pain 
remains high after the treatment in the clinic. 

In the following section we give three- variable examples of a Gaussian joint response 
regression and of the three subclasses of regression graphs that have only one type of 
edge, of the covariance, the concentration and the directed acyclic graph to discuss 
the different types of conditional dependences and the possible types of independence 
constraints associated with the corresponding regression graphs. 



4 Regressions, dependences and recursive relations 

For a quantitative response with linear dependences, the simple regression model dates 
back at least several centuries. The fitting of a least-squares regression line had been de- 
veloped separately by Carl Friedrich Gauss (1777-1855), Adrien-Marie Legendre (1752- 
1833) and Robert Adrain (1775 -1843). The method extends directly to models with 
several explanatory variables. 

The most studied regression models are for joint Gaussian distributions. Regression 
graphs mimic important features of these linear models but represent also relations 
in other distributions of continuous and discrete variables, which permit in particular 
nonlinear and interactive dependences. In a regular joint Gaussian distribution, let the 
mean-centered vector variable Y have dimension three, then we write the covariance 
matrix, E, and the concentration matrix with graphs shown in Figure [TUl as 

/ (Til 0"l2 cTisX / c 
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where the dot-notation indicates entries in a symmetric matrix. 
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Figure 10: For unconstrained trivariate Gaussian distributions, the parameters attached to 
the edges are those corresponding to a) a covariance graph, b) a concentration graph. 

With the edge of node pair (1, 2) removed, both graphs turn into a V but have 
different interpretations. The resulting independence constraints are for Figures [10] a) 
and b), respectively, 

1X2 (cti2 = 0) and 1X2|3 ((7^2 ^ 0), 

wh ere the latter derives as a n important pro perty of concentration matrices: for proofs 



see 



Cox and WermuthI (Il996l ). Section 3.4 or lWermuth. Cox and Marchettil (120061 ) . Sec- 



tion 2.3. For other distributions, the independence interpretation of these two types of 
undirected graph remains unchanged, but not the parametrisation. A similar statement 
holds for directed acyclic graphs and, more generally, for regression graphs. 

For the linear equations that lead to a complete directed acyclic graph for a trivari- 
ate Gaussian distribution with mean zero, one starts with three mutually independent 
Gaussian residuals Si and takes the following system of equations, in which for instance 
/3i|3.2 is a regression coefficient for the dependence of response Yi on Y3 when Y2 is an 
additional regressor. Because of the form of the equations, one speaks of triangular 
systems also when the distribution of the residuals is not Gaussian, but the residuals 
are just uncorrelated, or expressed equivalently, if each residual is uncorrelated with the 
regressors in its equation: 

Yi = /3i|2.3l"2 + /3l|3.2>"3 + ei 

Y2 = (32\3Y3 + €2 (3) 

Y3 = es. 

When the residuals do not follow Gaussian distributions, the probabilistic independence 
interpretation is lost, but the lack of a linear relation can be inferred with any vanishing 
regression coefficient. 

In econometrics, Hermann Wold (1908-1992) introduced such systems as linear re- 
cursive equations with uncorrelated residuals. Harald Cramer (1893-1985) used the 
term linear least-squares equations for residuals in a population being uncorrelated with 
the regressors and the notation for the regression coefficients is an adaption of the one 
introduced by Udny Yule (1871-1951) and William Cochran (1909-1980). 
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In joint Gaussian distributions, independence constraints on triangular systems mean 
vanishing equation parameters and missing edges in directed acyclic graphs, such as 



1X213 



(/3i|2.3 = 0) and 2X3 



(/32|3 = 0). 



The complete directed acyclic graph defined implicitly with equations 
in Figure [TTh) . 



is displayed 



a) 2 

/3l|2.3/-°>^2|3 
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1 3.2 



b) 2 
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Figure 11: Parameters of a Gaussian distribution in: a) a complete , b) a complete G 
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For the smallest joint response model with the complete graph shown in Figure [TTb). 
we take both Gaussian variables Yi and Y2 to depend on a Gaussian variable Y3, to get 
equations (|4]) with residuals having zero means and being uncorrelated with Y^: 



Yi=(3^nY3 + uu Y2 = (32i3Ys + U2, Ys 



U3. 



(4) 



Here, (Ti2|3 = E{uiU2)- The generating processes and hence the interpretation diff'ers for 
the two models in equations ([3]) and In the corresponding graphs of Figures fTTk) and 
[TTb). the vanishing of the edges for pairs (1,2) and (2,3) mean the same independence 
constraints since 



1X213 



(^12|3 = 0) 



(/3 



1|2.3 



0) and 2X3 



(/32|3 = 0), 



but the edges for pair (1,3) capture different dependences, 1 rtl 3 and 1 rh 3|2, respectively. 
Again, taking away any edge generates a V. Taking away any two edges means to combine 
two independence statements. This is discussed further in the next section. 

One of the special important features of the linear least-squares regressions is that 
the residuals are uncorrelated with the regressors. The effect is that the model part 
coincides with a conditional linear expectation as illustrated here with a model for 
response Yi and regressors 1^2, ^3, which we take, as mentioned before, as measured in 
deviations from their means. For instance, one gets for 



Yi — /3l|2.3^2 + /3l|3.2^3 + ^1; 
£:iin(l^l|1^2,l3) =/3l|2.3>2 + /3l|3.2l^3 • 



(5) 



The re is a recursive re l ation for least-squares re g ressio n coefficients; see ICochran 



fll938h . lGox and WermuthI fl2nn3h . lMa. Xie and Gene! (120061 ). It shows for instance with 

/3l|3 = /3l|3.2 + /3l|2.3/32|3 (6) 
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that /3i|3,2, the partial coefficient of Is given also Y2 as a regressor for Yi, coincides with 
the marginal coefficient, /3i|3, if and only if /3i|2.3 = or (32\3 = 0. 

The method of maximizing the likelihood was recommended by Sir Ronald Fisher 
(1890-1962) as a general estimation technique that applies also to regressions with 
categorical or quantitative responses. One of the most attractive features of the method 
concerns properties of the estimates. Given two models with parameters that are in one- 
to-one correspondence, the same one-to-one transformation leads from the maximum- 
likelihood estimates under one model to those of the other. 

Different single response regressions, such as logistic, prob it, or linear regressions, 
were d escribed as special cases of the generaliz ed linear model by lNelder and Wedderburn 
fll972l ): see also iMcCullagh and Nelderl fll989l ). In all of these regressions, the vanishing 
of the coefficient (s) of a regressor indicates conditional independence of the response 
given all directly explanatory variables for this response. 

The general linear model with a vector response, also called multivariate linear re- 
gression, has identical sets of regressors for each component variable of a response vector 
variable. Maximum-likelihood estimation of regression coefficients for a joint Gaussian 
distribution red uces to linear-least squares fitting for each component separately; see 
Anderson f 19581 ) . Chapter 8. 

With different sets of regressors for the components of a vector response, seemingly 
unrelat e d regr essions (SUR) result and iterative methods are needed for estimation; see 
Zellnerl (119621 ). For small sample sizes, a give n solution of the likelihood equations of a 
Gauss ian SUR model may not be unique; see iDrton and Richardson! (120041) . ISundbere 
(j2010l ). while for exclusively discrete variables this will never happen; see iDrtonI (120091 ). 
For mixed variables, no corresponding results are available yet. 

In general, there often exists a covering model with nice estimation prop- 
erties. For instance, one of the above described Gaussian SUR models that requires 
iterative fitting has regression graph 



o o ^ 



A generating process starts with independent explanatory variables, each of which re- 
lates only to one of the two response components, but these are correlated given both 
regressors. There is a simple covering model, in which two missing arrows are added to 
the graph to obtain a general linear model. In that case, the new graph does not provide 
a dependence base, but closed form maximum-likelihood estimates are available. 

For a vector variable of catego rical responses only, the multivariate logistic regression 
of iGlonek and McCuUaghl (119951 ) reduces to separate main effect logistic regressions for 
each comp onent of the response vector pro vided that certain higher-order interactions 
vanish; see iMarchetti and Lupparellil ( 120111 ). In the context of structural equation mod- 
els (SEMs), dependences of binary categorical variables are modeled in terms of probit 
regressions. These do not differ substantially from logistic regressions whenever the 
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smallest and largest events occur at least with probability 0.1; see ICoxl (119661 ). 

Multivariate linear regressions as well as SUR models belong to the framework of 
SEMs even though this general class had been developed in econometrics to deal appro- 
priately with endogenous responses. Estimation methods for SEMs were discussed in 
the Berkeley symposia on mathematical statistics and proba bility from 1945 to 1965, but 
some i dentification issues have been set tled only recently: see 



2011 



) and for relevant previous results 



bveel. Draisma and Drton 



Brito and Pearl! (120021 ) . IStanghellini and Wermuth 



(l2005h . 

In statistical models that treat all variables on equal standing, the variables are not 
assigned roles of responses or regressors and undirected measures of dependence are used 
instead of coefficients of directed dependence. In the concentration graph models, the 
undirected dependences are conditional given all remaining variables on equal standing. 



For instance, for categorical yariables, these models are be t ter k n own as graphical 



UmE), IWermuthI (Il976ah 
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1970). 


BishoD. Fienbere and Holland 


Darroch. Lauritzen and Speedl ( 


198f 


1). For Gaussian ran- 



dom y a riables, these had b een introduced as c o yarian c e selection models; s e e iDempster 
Jl972h . IWermuthI (Il976bh . ISpeed and Kiiveril ( Il986h . IPrton and PerlmanI koO^ . and 
for mixed variables as gr a phica l models fo r condi tional Gaussian (CG) distributions; see 



Lauritzen and Wermuth! ( 



1989h . lEdwardsl (120001 ). 



For a mean-centered vector variable Y, the elements of the covariance matrix E are 
(Tjj = E{YiYj). If E is invertible, the covariances aij are in a one-to-one relation with the 
concentrations cr*-' , the elements of the concentration matrix E"^. There is a recursive 



relation for concentrations; see iDempsterl (119691 ). For a trivariate distribution 



a 



23.1 



cr 



23 



^12^13 

cr a 



(7) 



where cr^^-^ denotes the concentration of Y2, Y^ in their bivariate marginal distribution. 
Thus, the overall concentration cr^^ coincides with cr^^-^ if and only if cr^^ = or a^^ = 0. 

Alternatively in covariance graph models, the undirected measures for variables 
on equal standing are pairwise marginal dependences. For Gaussian variables, these 



models had been i ntroduced as hypotheses linear in covariances ; see I Anderson! (119731). 

Kauermann (jl996 ). Kiiveri ( 1987 ). Wermuth. Cox and Marchetti ( 2006 ). Chaudhuri. Drton and Richa 
J2OO7I ). 7or categorical variables, coyarianc e graph models have been studied on l y mor e 



recently; see iDrton and Richardson! (!2008a! ) , !Lupparelli. Marchetti and Bergsma! (12009! ) . 
Again, no similar estimation results are available for gen eral mixed yariables yet. 

There is also a recursive relation for covariances; see lAndersonI (11958! ). Section 2.5. 
It shows for instance, for just three components of Y having a Gaussian distribution, 
with 

o'uis = 12 — o- 13(723 /ass, (8) 

where cri2|3 denotes the covariance of Yi, Y2 given Y3. Therefore, cri2|3 coincides with cri2 
if and only if cri3 = or 0-23 = 0. By equations Qj, dTj), (l8|), a unique independence 
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statement is associated with the endpoints of any V in a trivariate Gaussian distribution. 

In the context of multivariate exponential families of distributions, concentrations 
are special canonical parameters and covariances are special moment parameters with 
estimates of canonical a nd moment parameters being asymptotically independent; see 



B arndorff- NielsenI (Il978l ). page 122. Regression graphs capture independence struc- 



tures for more general types of distribution, where operators for transforming graphs 
mimic oper ators for transforming different pararn e trisations of joint Gaussian distr i- 
b utions; s e elWermuth. Wiedenbeck and Coxl (120061 ) , IWiedenbeck and WermuthI (120101 ) , 



WermuthI (120111 ). 



In particular, by removing an edge from any V of a regression graph, one introduces 
an additional independence constraint just as in a regular joint Gaussian distribution. 
For this, the generated distributions have to satisfy the composition and intersection 
property in addition to the general properties, as discussed in the next section. 



5 Using graphs to combine independence statements 



We now state t he four stand a rd proper t ies of independences of any multivariate distri- 
bution; see e.g. bawidlJigygh . Istudenyl (j2005l ). as well as two special properties of joint 



Gaussian distributions. The six taken together, describe the combination and decom- 
position of independences in regression graphs, for instance those resulting by removing 
edges. We discuss when these six properties apply also to regression graph models. 

Let X, Y, Z be random (vector) variables, continuous, discrete or mixed. By using 
the same compact notation, fxYZ for a given joint density, a probability distribution or 
a mixture and by denoting the union of say X and Y by XY, one has 



XMYIZ 



ifxYZ — fxzfyz/fz), 



(9) 



where for instance fz denotes the marginal density or probability distribution of Z. 
Since the order of listing variables for a given density is irrelevant, symmetry of 
conditional independence is one of the standard properties, that is 



(z) XMY\Z YMX\Z. 

Equation ([9]) restated for instance for the conditional distribution of X given Y and Z, 
fx\YZ = fxYz/ fvz, is 

XMY\Z ^ Ux\YZ = fx\z)- (10) 

When two edges are removed from a graph in Figures [10] and [11], just one coupled 
pair remains, suggesting that the single node is independent of the pair. 

For instance in Figure [TTk). with nodes 1, 2, 3 corresponding in this order to X, y, Z, 
removing the arrows for (1,2) and (2,3), leaves (1,3) disconnected from node 2. For any 
joint density, implicitly generated as fxYZ = fx\YzfY\zfz, one has equivalently, 

{XMY\Z and YMZ) XZMY. 
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In general, the contraction property is for a, 6, c, d disjoint subsets of N: 



N 
dag 



[a) {a-lLb\cd and b-lLc\d) -^^^^ ac-lLb\d. 

It has become common to say that a distribution is generated over a given G 
if the distribution factorizes as specified by the graph for any compatible ordering. 
For instance, for a trivariate distribution generated over the collision V of Figure [TTb) 
obtained by removing the edge for (2,3), both orders (1, 2, 3) and (1, 3, 2) are compatible 
with the graph and fxYZ = fxiYzfvfz- 

Conversely, suppose that XZALY holds, then this implies X ALY and Z ALY so 
that for instance the same two edges as in Figure ITTb) are missing in the corresponding 
covariance graph of FigurefTOk). In general, the decomposition property is for a, b, c, d 
disjoint subsets of A^: 

(in) aALbc\d =^ (a_LL6|(i and a_LLc|(i). 

In addition, XZ ALY implies X ALY\Z and Z ALY\X so that for instance the same 
two edges as in Figure [TTb) are missing in the corresponding concentration graph of 
Figure ITOb). In general, the weak union property is for a,b,c,d disjoint subsets of 
N: 

(iv) a-lLbc\d =^ {aALb\cd and aALc\bd). 

Under some regularity conditions, all joint distributions share the four properties [i) to 
{iv). 

Joint distributions, for which the reverse implication of the decomposition prop- 
erty (Hi) and of the weak union property (iv) hold such as a regular joint Gaussian 
distribution, are said to have, respectively, the composition property (v) and the 
intersection property (vi), that is for a, b, c, d disjoint subsets of A^: 

{v) {aALb\d and a-lLc\d) =^ a-lLbc\d, 

(vi) {aALb\cd and aALc\bd) =^ aALbc\d. 

The standard graph theoretical separation criterion has different consequences for the 
two types of undirected graph corresponding for Gaussian distributions to concentration 
and to covariance matrices. We say a path intersects subset set c of node set 
if it has an inner node in c and let {a, b, c, m} partition N to formulate known Markov 
properties. The notation is to remind one that with any independence statement aALb\c, 
one implicitly has marginalised over the remaining nodes in m = V\{aUbUc}, i.e. one 
considers the marginal joint distribution of Yq, Y^, Y^.. 



Proposition 1. iLauritzenl ( 1l996l ). A concentration graph, G^^^ , implies aiL6|c if and 



only if every path from a to b intersects c. 
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Proposition 2. iKauermannI fll996l ). A covanance graph, , implies aALb\c if and 
only if every path from a to b intersects m. 

Notice that Proposition [T] requires the intersection property, otherwise one could not 
conclude for three distinct nodes h,i,k e.g. that {hALi\k and h-lLk\i) implies hALik 
while Proposition [2] requires the composition property, otherwise one could conclude e.g. 
that {HALi and h-lLk) implies hALik. 

Corollary 1. A covariance graph, G^^^ , or a concentration graph, , implies aALb 
if and only if in the subgraph induced by aU b, there is no edge between a and b. 

Corollary 2. A regression graph, , captures an independence structure for a dis- 
tribution with density fN factorizing as if the composition and intersection property 
hold for fi\i, in addition to the standard properties of each density. 

Proof. Given the intersection property [vi), any node i with missing edges to nodes 
/c, / in a concentration graph of node set implies iAL{k, 1}\N \ {i, k, 1} and given the 
composition property (v), any node i with missing edges to nodes fc, / in a covariance 
graph given Yc implies iAL{k, l}\c. □ 

For purely discrete and for Gaussian distributions, i iecessary and sufficient conditions 
for the intersection property (vi) to hold are known; see lSan Martin. Mouchart and Rolin 
(120051 ). Too strong sufficient conditions are for joint Gaussian distributions that they 
are regular and for discrete variables, that the probabilities are strictly positive. 

The composition property (v) is satisfied in Gaussian distributions and for triangu- 
lar binary distributions with a t mos t main effects in symmetric (—1,1) variables; see 
Wermuth. Marchetti and Cox 

Both properties [v) and (vi) hold, whe never a dis t ribut io n may have been gener- 
ated o v er a possibly larger parent graph; see IWermuthI (120111 ) , iMarchetti and Wermuth 
(l2009l ). IWermuth. Wiedenbeck and Coxl (120061 ). Parent graphs are directed acyclic 
graphs that do not only capture an independence structure but are also a dependence 
base with a unique independence statement assigned to each V of the graph. A dis- 
tribution generated over a parent graph mimics these properties of the parent 
graph. 

It is known that every regression graph can be generated by a larger directed acyclic 
graph but no t necessarily every stati s tical regression graph model can be generated in 
this way; see iRichardson and Spirted (120021 ) . Sections 6 and 8.6. 

One needs similar properties for distributions generated over a regression graph. A 
graph is edge-minimal for the generated distribution if the distribution has a 
pairwise independence for each edge missing and a non-vanishing dependence for each 
edge present in the graph. For the generated distribution to have a unique independence 
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statement assigned to each missing edge, it has to be singleton transitive that is, for 
h, i, k, I distinct nodes of N, 



{iMk\l and iMk\lh) 



{iMh\l or kMh\l). 



this says, that in order to have both a conditional independence of Yi, Yk given Yi and 
given Yi,Yh, there has to be at least one additional independence involving the variable 
Yh, the additional variable in the conditioning set. For graphs representing a dependence 
structure, this can be expressed equivalently, as 



{i rh h\l and k rtl h\l and iALk\l) 



i rh k\{l,h} 



and 



{i rtl h\l and k rtl h\l and iALk\{l, h}) =^ i rtl k\l, 

which says that in the distribution there is a unique independence statement that corre- 
sponds to each V in the graph. For a 2 x 2 x 3 contingency table, an exa mple violating 
singleton-transitivity has been given with equation (5.4) bv iBirchI (119631 ). 



Th e re exist t hese p e culiar types of incomplete familie s of dis tributions; see lLehmann and Scheffe 
( 1955 ). Brown (jl986 ). Mandelbaum and Riischendorl ( 1987 ). in which independence 
statements connected with a V may hav e the i nner node b oth withi n and outside the 
conditioning set; see IWermuth and Coxl (120041 ). Section 7, iDarrochI (119621 ). Such in- 
dependences have also been characterized as being not representable in joint Gaussian 
distributions; see iLnenicka and MatusI (120071 ). These distributions and those that are 
faithful to graphs are of limited interest in application in which one wants to interprete 
sequences of regressions. 

Distribution are said to be faithful to a graph if every of its independence con- 
strain ts is captured by a given independence graph; see ISpirtes. Glvmour and Scheines 
(119931 ). As is proven in a forthcoming paper, this requires for regression graphs that (1) 
the graph represents both an independence and a dependence structure, and that (2) 
the distribution satisfies the composition and the intersection property and is weakly 
transitive, a property that is the following extension of singleton transitivity for node 
h replaced by a subset d of N \ {i, k, 1} that may contain several nodes: 



{iMk\l and iMk\{l,d}) 



{iMd\l or kMd\l). 



This faithfulness property imposes strange constraints on parameters whenever more 
th an two nodes induce a complete sub graph in the graph; see for instance Figure 1 



Wermuth. Marchetti and Coxl (2009) for three binary variables. An early example 



m 

of a regular Gaussian d istribution that does not satisfy weak transitivity is due to 
Cox and Wermuth ( Il993h . equation (8). 

Notice that in general, the extension of singleton transitivity to weak transitivity 
excludes parametric cancelations that result from several paths connecting the same 



22 



le only type of a possible parametric cancelation in regular Gaussian 



Wermuth and Coxl fllQQSh 



node pair. This t 
distributions; see 

However, the constraints are mild for distributions corresponding to regression graphs 
that form a dependence base and that are forests. Forests are the union of disjoint 
trees and a tree is a connected undirected graph with one unique path joining every 
node pair. 

Lemma 1. A positive distribution is faithful to a forest representing both a an indepen- 
dence and a dependence structure if it is singleton transitive. 

Proof. Positive distributions satisfy the intersection property and for concentration 
graphs, the composition property is irrelevant. Given the above characterizations of 
faithfulness and of weak transitivity, there are in a forest no cancelations due to several 
paths connecting the same node pair. Hence, weak transitivity will be violated only if 
the singleton transitivity fails. □ 

Corollary 3. A regular Gaussian distribution is faithful to a forest representing both 
an independence and a dependence structure. 

Notice that forests include trees and Markov chains as special cases. If they form 
dependence bases they are Markov equivalent to very special types of parent graphs but 
they are rarely of interest in statistics when studying sequences of regressions. 



6 Some early results on graphs and Markov equivalence 

In the past, results concerning graphs and Markov equivalence have been obtained quite 
independently in the mathematical literature on characterizing different types of graph, 
in the statistical literature on specifying types of multivariate statistical models, and in 
the computer science literature on deciding on special properties of a given graph or on 
designing fast algorithms for transforming graphs. 

For instance, following the simple enumeration result for labeled trees in d nodes, 
by Karl-Wilhelm Borchardt (1817-1880), it could be show n that these t rees are 
in one-to-one correspondence to distinct strings of size d — 2; see ICayley f ll889h . Much 
later, labeled trees were recognized to form the subclass of directed acyclic graphs with 
exclusively source Vs and therefore to be also Markov equiv alent to chordal conc e ntrat ion 
graphs that are without chordless paths in four nodes; see lCastelo and Siebed (120031 ). 

In the literature on graphical Markov models, a number of different nain es have been 
in use for a sink V, for instance 'two arrows meeting head-on' by lPearll (119881 ). ' unshielded 
collide r' by iRichardson and Spirted (120021 ). and 'Wermuth-configuration' bv IWhittaker 
(llQQOl ). after it had been recognized that, for Gaussian distributions, the parameters of 
a directed acyclic graph model without sink Vs are in one-to-one correspondence to the 
parameters in its skeleton concentration graph model. 
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Propositions. (IWermuthl . Il980l ) . f lWermuth and Lauritzenl . ll983l ). (jFrydenber gl . 1 1 9901 ) . 

A directed acyclic graph is Markov equivalent to a concentration graph of the same skele- 
ton if and only if it has no collision V. 

Efficient algorithms to decide whether an undirected graph can be oriented into 
a directed acyclic graph, became available in the computer science l i teratu re under 
the name of perfect elimination schemes; see Tar j an and YannakakisI ( 1984 ). When 
algorithms were designed later to decide which arrows may be flipped in a given , 
keeping the same skeleton and the same set of sink Vs, to get to a list of all Markov 



equivalent Ga s, these ea r 



to directly; see IChickering] fll995l ). 



y res ults by Tarjan and Yanakakis appear are not referred 



The number of equivalent characterizations of concentration graphs that have perfect 
eliminatio n schemes ha s increased steadily, since they were introduced as rigid circuit 
graphs by iDirad (jl96ll ). These graphs are not only named 'chordal graphs', but also 
'triangulated graphs", 'graphs with the rur ining intersection propert y' or 'graphs with 
only complete prime graph separators'; see ICox and WermuthI (119991 ). 

By contrast, for a covariance graph that can be oriented to be Markov equivalent to 
a G^gOf the same skeleton, chordless paths are relevant. 



Proposition 4. fjPearl and WermuthI . Il994l ). A covariance graph with a chordless path 
in four nodes is not Markov equivalent to a directed acyclic graph in the same node set. 

For distributions generated over directed acyclic graphs, sink Vs are needed again. 



Proposition 5. (IFrvdenbergl . 119901 ). (jVerma and Pearll . Il990l ). Directed acyclic graphs 
of the same skeleton are Markov equivalent if and only if they have the same sink Vs. 

Markov equivalence of a concentration graph and a covariance graph model is for reg- 
ular joint Gaussian distributions equivalent to parameter equivalence, which means 
that there is a one-to-one relation between the two sets parameters. Therefore, an early 
result on parameter equivalence for joint Gaussian distributions implies the following 
Markov equivalence result for distributions satisfying both the composition and the 
intersection property. 



Proposition 6. fjJensenl . Il988l ). fjPrton and Richardsonl . l2008bl ). A covariance graph 
is Markov equivalent to a concentration graph if and only if both consist of the same 
complete, disconnected subgraphs. 

Fast ways of inserting an edge for every transition V, of deciding on connectivity 
and on bl ocking fl ows h ave been available in the corresponding Russian literature since 
1970; see iDinitzl ( l2006l ). but these results appear to have not not been exploited for 



the so-called lattice conditional independence models, recognized as distributions gener- 
ated o ver G^gS without any transition Vs by lAndersson. Madigan. Perlman and Triggs 
Jl997h . 
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Mark ov equival e nce o f other than multivariate r egress ion c hain graphs, have been 
given by iRoveratd (120051 ) , lAndersson and PerlmanI (120061 ) and iRoverato and Studeny 
J2006h . 

With the so-called global Markov property of a graph in node set N and any disjoint 
subsets a,b,c of A^, one can decide whether the graph implies a_LL6|c. To give this 
property f or a regre s sion g raph, we use special types of path that have been called 



active; see IWermuthI (120 111 ). For this, let again {a,b,c,m} partition the node set of 



^rcg ' 



Definition 1. A path from a to b in Greg is active given c if its inner collision 
nodes are in c or have a descendant in c and its inner transmitting nodes are in m = 
N\{aUbUc). Otherwise, the path is said to break given c or, equivalently, to 
break with m. 



Thus, a path breaks when c includes an inner transmitting n ode or when m includes 
an inn er collision node and all its descendants; see also Figure 4 of lMarchetti and Wermuth 

For directed acyclic graphs, an active path of Definition 1 reduces to the d-connecting 
path of iGeiger. Verma and Pearl! (jl990l ). Similarly, the following proposition coincides 
in that special case with those of their so-called d-separation. Let node set of G 
partitioned as above by {a, b, c, m}. 



rj,gg be 



Proposition 7. ( ICox and Wermuthl . Il996f ). ( ISadeghil . l2009l ). A regression graph, G 
implies aALb\c if and only if every path between a and b breaks given c. 



N 

rcg ? 



Thus, whenever implies aALb\c, this independence statement holds in the cor- 
responding sequence of regressions for which the density f^ factorizes as ([T]), provided 
that /tv satisfies the same properties of independences, (z) to (vi) of Section 5, just like 
a regular Gaussian joint density. For example, in the graphs of Figure [T21 node 2 is an 
ancestor of node 1 so that Gi^^does not imply 3_LL4|2. 



rcg 



a) 



1 O* 




b) 



1 CN- 



c) 



>0 4 



1 CX- 



2 



'^3 



2 



\5 3 



Figure 12: Three regression graphs, which imply 3_IL4 but not 3_1L4|1. 



Since covariance and concentration graphs consist only of one type of edge, the 
restricted versions in Propositions 1 and 2 of the defined path can be used for their 
global Markov property. 
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7 The main new results and proofs 

We now treat connected regression graphs in node set N and corresponding distributions 
defined by sequences of regressions witli joint discrete or continuous responses, ordered 
in connected components gi, . . . ,gr of tfie graph, and with context variables in connected 
components, gr+i, ■ ■ ■ ,gj, which factorize as in ([1]), satisfy the pairwise independences 
of ([2]) as well as properties of independence statements, given as (i) to (vi) in Section 5. 

For the main result of Markov equivalence for regression graphs, we consider distinct 
nodes i and k, node subsets c of \ {i, k} and the notion of shortest active paths. 

Definition 2. An ik-path in is a shortest active path it with respect to c if every 
ik-path of G^^g with fewer inner nodes breaks given c. 

Every chordless vr is such a shortest path. If the consecutive nodes {kn-i, kn, kn+i) 
on TT = {i = ko, ki, . . . , km = k) induce a complete subgraph in , we say that there 
is a triangle on the path. In Figure [T5k) nodes 2,3,4 form a triangle on the path 



Figure 13: Graphs of active five-node paths a) with path (1, 2, 4, 3, 5) the shortest active path, 
where 3 is in c, b) active path (4, 2, 1, 3, 5), where 1 is in c, and a shorter active path (4, 2, 3, 5). 

If this path is an active path connecting the uncoupled node pair (1,5), then nodes 2 
and 4 are inner transmitting nodes outside c and the inner collision node 3 is in c. This 
path is then also the shortest active path connecting (1,5). The shorter path (1,2,3,5) 
has nodes 2 and 3 as inner transmitting nodes, but is inactive since node 3 is in c. 

By contrast in Figure fT3b). when path (4, 2, 1, 3, 5) is an active path connecting the 
uncoupled node pair (4,5), then path (4,2,3,5) is a shorter active path. To see this, 
notice that on an active (4,2,1,3,5) path, the inner collision node 1 is in c and the 
inner transmitting nodes 2 and 3 are outside c. In this case, the inner collision node 2 
on the path (4, 2, 3, 5) has node 1 as a descendant in c, so that this shorter path is also 
active. 

We also use the following results for proving Theorem [T] The first two are direct 
consequences of Proposition [7] and imply the pairwise independences of equation ([2]). 
Lemma m results with the independence form of (|2]). Let h, i, k be distinct nodes of A^. 

Lemma 2. For {h, i, k) a collision V in , the inner node i is excluded from c in 
every independence statement for h, k implied by G^g . 

Lemma 3. For {h, i, k) a transmitting V in Gf^g , the inner node i is included in c in 
every independence statement for h, k implied by . 



(1,2,4,3,5). 
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Lemma 4. A missing ik-edge in implies at least one independence statement i Jl k\c 
for c a subset of N \ {i, k}. 

We can now derive the first of the main new results in this paper. 

Theorem 1. Two regression graphs are Markov equivalent if and only if they have the 
same skeleton and the same sets of collision Vs, irrespective of the type of edge. 

Proof. Regression graphs G^^^^ and G^gg Markov equivalent if and only if for every 
disjoint subsets a, b, and c of the node set of A^, where only c can be empty, 

(G,^g, =^ aMb\c) ^ {GZ,2 =^ aMb\c). (11) 

Suppose first that (ITT!) holds. By Lemma HI G^g^ and G^gg have the same skeleton, 
and by Lemma [2] and Lemma [31 G^g^ and G^g2 have the same collision Vs. 

Suppose next that G^^^i and G^g2 have the same skeleton and the same collision Vs 
and consider two arbitrary distinct nodes i and k and any node subset c of \ {i, k}. 
By Proposition [71 f[TT]) is equivalent to stating that for every uncoupled node pair i, k, 
there is an active path with respect to c in G^^^i if and only if there is an active z/c-path 
with respect to c in G^g2 . 

Suppose further that path n is in G^^^i a shortest active zfc-path with respect to c. 
Since G^g;^ and G^g2 have the same skeleton, the path vr exists in G^g2 . We need to 
show that it is active. If all consecutive two-edge-subpaths of vr are Vs then vr is active 
in G^g2 . Therefore, suppose that nodes (fcn-i, kn, fcn+i) on vr form a triangle instead of 
a V. It may be checked first, that in all other possible triangles in regression graphs 
that can appear on vr than the two of Figure [T31 there is as in Figure [T^) a shorter 
active path. To complete the proof, we show that for the two types of triangles shown 
in Figure [UK) and Figure [Hb) path vr is also in G^g2 an active z/c-path with respect to 
c. 



a) 




Figure 14: The two types of triangles in regression graphs without a shorter active path 
whenever the path with inner nodes A;^, fcn-i) is active. 

In G^g^ containing the triangle of Figure [Tlh) on a shortest active path vr, node kn 
is a transmitting node, which is by Lemma [3] outside c. By Lemma [21 node kn-i is a 
collision node inside c. If instead kn-i were a transmitting node on vr in G^g;^ , it would 
also be a transmitting node on {kn-2, kn-i, kn+i) and give a shorter active path via the 
kn-ikn+i-edge, contradicting the assumption of vr being a shortest path. Similarly, if 
collision node kn-i on vr were only an ancestor of c, then there were a shorter active 
path via the A;„_iA;„+i-edge. 
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In addition, node pair /c„, kn-2 is uncoupled in G^egi since by inserting any such edge, 
that is permissible in a regression graph, another shortest path via the A;„_2A;„-edge 
would result. Therefore, since G^g^ and G^g2have the same collision Vs, the subpath 
{kn~2, kn-1, kn) forms also a colhsion V in G^g2 • Similarly, {kn-2, kn~i, kn+i) is a trans- 
mitting V and {kn+2, kn+i, kn) is a V of either type. Hence kn-i is a parent of kn+i in 
G^g2 ^'^d the only permissible edge between kn and kn+i is an arrow pointing to kn+i- 
Therefore, vr forms an active path also in Greg2 • 

The proof for Figure [Mb) is the same as for Figure [T4k) since the type of nodes 
along vr, i.e. as collision or transmitting nodes, are unchanged. □ 

In the example of Figure [15], all three regression graphs have the same skeleton. 




Fi gure 15: a) Regression graph Gj-^gi , b) a Markov equivalent regression graph. Gj-Qg2 to G^.^^^ , 
c) a regression graph G^gg that is directed acyclic and not Markov equivalent to G^^^-^ . 

In G^gi there are three colhsion Vs (3,4,5), (1,2,5), and (2,1,3). In G'reg2 there 
are the same collision Vs. Therefore, these two graphs are Markov equivalent. However, 
there are only two collision Vs in G^gg these are (3, 4, 5), and (2, 1, 3). Hence this graph 
is not Markov equivalent to Gf^^i and G^g2 . The Markov equivalence of the graphs 
in Figure [2] to the subgraph induced by {b, c} in Figure [1] are further applications of 
Theorem [T] Notice that Propositions 3 to 8 of Section 6 result as special cases of 
Theorem [H 

The following algorithm generates a directed acyclic graph from a given G^g that ful- 
fills its known necessary conditions for Markov equivalence to a directed acyclic graph; 
see Proposition 2 of Wermuth (2010). We refer to these connected components as the 
blocks of G^, . 

Algorithm 1. (Obtaining a Markov equivalent directed acyclic graph from a regres- 
sion graph). Start from any given G^g that has a chordal concentration graph and no 
chordless collision path in four nodes. 

1. Apply the maximum cardinality search algorithm on the block consisting of full 
lines to order the nodes of the block. 

2. Orient the edges of the block from a higher number to a lower one. 

3. Replace collision \Js by sink Vs, i.e. replace i o k and i o — k by 

i — -4. — k when i and k are uncoupled. When a dashed line in a block is replaced 

28 



by an arrow, label the endpoints such that the arrow is from a higher number to a 
lower one if the labels do not already exist. 



4- Replace dashed lines i o k of triangles by a sink path i — y o — k. When 

a dashed line in a block is replaced by an arrow, label the endpoints such that the 
arrow is from a higher number to a lower one if the labels do not already exist. 

5. Replace dashed lines by arrows from, a higher number to a lower one. 

Continually apply each step until it is not possible to continue applying it further. Then 
move to the next step. 

Lemma 5. For a regression graph with a chordal concentration graph and without chord- 
less collision paths in four nodes. Algorithm 1 generates a directed acyclic graph that is 
Markov equivalent to . 

Proof. The generated graph is directed since by Algorithm 1, all edges are turned into 
arrows. Since the block containing full lines is chordal, the graph generated by the 
perfect el imination order o f the m aximal cardinal ity search does not hay e a di rected 



cycle; see iBlair and Peyton! (119931 ) Section 2.4 and lTarjan and YannakakisI (jl984f ). 

In addition, the arrows present in the graph do not change by the algorithm. Thus, 
to generate a cycle containing an arrow of the original graph, there should have been 
a cycle in the directed graph generated by replacing blocks by nodes. But, this is 
impossible in a regression graph. Therefore in the generated graph, there is no cycle 
containing arrows that have been between the blocks of the original graph. 

Within a block, all arrows point from nodes with higher numbers to nodes with lower 
ones. Otherwise, there would have been at step 3 of the algorithm a chordless collision 
path with four nodes in the graph. Hence no directed cycle can be generated. 

Theorem [1] gives Markov equivalence to since Algorithm 1 preserves the skeleton 
of G^g and no additional collision V is generated because sink oriented Vs remain, only 
dashed lines are turned into arrows and no arrows are changed to dashed lines. □ 

Notice that this algorithm does not generate a unique directed acyclic graph, but 
every generated directed acyclic graph is Markov equivalent to the given regression 
graph. To obtain the overall complexity of Algorithm 1, we denote by n the number of 
nodes in the graph and by e the number of edges in the graph. 

Corollary 4. The overall complexity of Algorithm 1 is O(e^). 

Proof. Suppose that the input of Algorithm 1 is a sequence of triples, each of which 
consists of the two endpoints of an edge and of the type of edge. The length of this se- 
quence is equal to e and the highest number appearing in the sequence is n. For example, 
the sequence to the graph of Figure [T5k) is ((1, 2, c/), (3, 1, a), (5, 2, a), (4, 3, rf), (4, 5, d)), 
where 'd' corresponds to a dashed line and 'a' corresponds to an arrow pointing from 
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the first entry to the second one. Notice that this labeling is in general not the same as 
the ordering of nodes given by Algorithm 1. 



Th e first two steps of Algorithm 1 can be performed in 0{e+n) time; see lBlair and Pevton 
fll993h . Step 3 of Algorithm 1 may be performed in e(e + l)(e — 2)/2 steps since for 
each edge, one can go through the edge set to find the edges that give a three node path 
with an inner collision node. This needs e(e + l)/2 steps. For each collision node, one 
goes again through the edge set, excluding the two edges involved in the collision path, 
to check if the collision is a V. Other actions can be done in constant time. 

Step 4 may require ne{e + l)/2 steps since paths considered o o o which do 

not form a V. Therefore, there is no reason to go through the edge set for the third time, 
but one might need to go through the node ordering to decide on the direction of the 
generated arrow. The last step may be performed with ne steps by going through the 
edge set changing 'd's to 'a's appropriately by looking at the node ordering. Therefore, 
the overall complexity of Algorithm 1 is O(e^). □ 

Corollary 2 and Propositions 4 to 8 can now be derived as special cases of Theorem 
1 and Lemma 4. In addition by using Lemma 1, Lemma 2 and pairwise independences, 
subclasses of regression graphs can be identified, which intersect with directed acyclic 
graphs, with other types of chain graphs, with concentration graphs or with covariance 
graphs. 

Theorem 2. A regression graph with a chordal graph for the context variables can he 
oriented to he Markov equivalent to a directed acyclic graph in the same skeleton, if and 
only if it does not contain any chordless collision path in four nodes. 

Proof. Every chor dal concentration graph c an be oriented to be equivalent to a directed 



acyclic graph; see iTarjan and YannakakisI (119841 ). A missing edge for node pair i < k 



in a directed acyclic graph means iiLfc| > i\k, which would contradict ^iiii) if the 
graph contained a semi-directed chordless collision path in four nodes. No undirected 
chordless collision path in four nodes can be fully oriented without changing a collision 
V into a transmitting V, but can be oriented using Algorithm 1 if it contains no 
such path. □ 

Notice that for joint Gaussian distributions. Theorem 2 excludes Zellner's seemingly 
unrelated regressions and it excludes covariance graphs that cannot be made Markov 
equivalent to fully directed acyclic graphs; see Proposition HI 

Proposition 8. A multivariate regression graph with connected components gi, . . .gj is 
an AMP chain graph in the same connected components if and only if the covariance 
graph of every connected component of responses is complete. 

Proof. The conditional relations of the joint response nodes in an AMP chain graph 
coincide with those of the regression graph with the same connected components. Fur- 
thermore, the subgraph induced by each connected component gj of an AMP chain 
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graph is a concentration graph given g^j while in G^^^ it is a covariance graph given g^j. 
By Proposition [HI these have to be complete for Markov equivalence. □ 

Proposition 9. A multivariate regression graph with connected components gi,...gj 
is a LWF chain graph in the same connected components if and only if it contains no 
semi- directed chordless collision path in four nodes and the covariance graph of every 
connected component of responses is complete. 

Proof. The proof for the connected components of a LWF chain graph is the same as 
for an AMP chain graph since they both have concentration graphs for gj given g^j. 
The dependences of joint responses gj on g^j coincide in a LWF chain graph with 
the bipartite part of the concentration graph in gj U g-^j so that Markov equivalent 
independence statements can only hold with these bipartite graphs being complete. □ 

Figure [12] illustrates Propositions |2] to IH] with modified graphs of Figure HI 






Figure 16: The graph of Figure H] modified by adding edges to obtain a graph that is Markov 
equivalent to a) a directed acyclic graph b) an AMP chain graph in the same connected 
components c) a LWF chain graph in the same connected components. 

The graphs in Figure dB] are Markov equivalent to a) a directed acyclic graph with 
the same skeleton obtainable by Algorithm 1, b) an AMP chain graph in the same 
connected components and c) a LWF chain graph in the same connected components. 
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In general, by inserting some edges, a regression graph model can be turned into 
a model in one of the intersecting classes used in Propositions [2] to [HI just as a non- 
chordal graph may be turned into chordal one by adding edges. When the independence 
structure of interest is captured by an edge-minimal regression graph, then the resulting 
graph after adding edges will no longer be an edge-minimal graph and hence will not 
give the most compact graphical description possible. 

However, the graph with some added edges may define a covering model that is 
easier to fit than the reduced model corresponding to the edge- minimal graph, just 
as an unconstrained Gaussian bivariate response regression on two regressors may be 
fitted in closed form, while the maximum-likelihood fitting in the reduced model of 
Zellner's seemingly unrelated regression requires iterative fitting algorithms. Any well- 
fitting covering model in the three intersecting classes will show week dependences for 
the edges that are to be removed to obtain an edge-minimal graph. 

Notice that sequences of regressions in the intersecting class with LWF chain graphs 
correspon d for Gaussian distributions to sequences of the general linear models of 



Anderson! fjl958[ ). Chapter 8, that is to models in which each joint response has the 
same set of regressor variables. This shows in by identical sets of nodes from which 
arrows point to each node within a connected component. 

In contrast, the models in the intersecting classes with the two types of undirected 
graph may be quite complex in the sense of including many merely generated chordless 
cycles of size four or larger. 

Proposition 10. A multivariate regression graph has the skeleton concentration graph 
if and only if it contains no collision V and it has the skeleton covariance graph if and 
only if it contains no transmitting V. 

Proof. Every V is a collision V in a covariance graph and a transmitting V in a concen- 
tration graph; see Lemma 1 and Lemma 2. The first includes, the second excludes the 
inner node from the defining independence statement. Thus, in the presence of a V, one 
would contradict the uniqueness of the defining pairwise independences. □ 

Lastly, Figure [T7] shows the overall concentration graph induced by of Figure 
m It may be obtained from the given G^gby finding first the smallest covering LWF 
chain graph in the same connected components, then closing every sink V by an edge, 
i.e. adding an edge between its endpoints, and finally changing all edges to full lines. 

In such a graph, several chordless cycles in four or more nodes may be induced 
and the connected components of G^g may no longer show. In such a case, much of 
the important structure of the generating regression graph is lost. In addition, merely 
induced chordless cycles require iterative algorithms for maximum-likelihood estimation, 
even for Gaussian distributions. Thus, in the case of connected joint responses, it may 
be unwise to use a model search within the class of concentration graph models. 
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Figure 17: The overall concentration graph induced by the regression graph in Figure |H 

This contrasts with LWF chain graphs that coincide with regression graphs, such as 
in Figure [TBb) . These preserve the available prior knowledge about the connected com- 
ponents and give Markov equivalence to directed acyclic graphs so that model fitting is 
possible in terms of single response regressions, that is by using just univariate condi- 
tional densities. In addition, the simplified criteria for Markov equivalence of directed 
acyclic graphs apply. 

On the other hand, sequences of regressions that coincide with LWF chains, per- 
mit us to model simultaneous intervention on a set of variables since the corresponding 
independence graphs are directed and acyclic in nodes representing vector variables. 
This represents a conceptually much needed extension of distributions generated over 
directed acyclic graphs in nodes representing single variables, but excludes the more 
specialized seemingly unrelated regressions and incomplete covariance graphs. 

Appendix: Details of regressions for the chronic pain data 

The following tables show the results of linear least-squares regressions or logistic 
regressions, one at a time, for each of the response variables and for each component of 
a joint response separately. At first, each response is regressed on all its potentially ex- 
planatory variables given by their first ordering. The tables give the estimated constant 
term and for each variable in the regression, its estimated coefficient (coeff), the esti- 
mated standard deviation of the coefficient (scocfr), as well as the ratio Zobs =coeff/scocff- 
These ratios are compared with 2.58, the 0.995 quantile of a random variable Z having a 
standard Gaussian distribution, for which Pr(|Z| > 2.58) = 0.01. In backward selection 
steps, the variable with the smallest observed value l^ofesl is deleted from a regression 
equation, one at a time, until the threshold is reached. 
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Response: Y, success of treatment; linear regression including a quadratic term 



starting model selected excluded 



explanatory variables 


coeff 


ScoefE 


^obs 


coeff 


•S coeff 


^obs 


f 

^obs 


constant 


23.40 


- 


- 


20.50 


- 


- 


- 


Za, pain intensity after 


-1.73 


0.15 


-11.19 


-1.89 


0.15 


-12.77 




denression after 


—0.16 


0.05 


—3.04 








— 1.86 


Zb, pain intensity before 


0.04 


0.16 


0.26 








0.65 


Xf,, depression before 


0.10 


0.05 


1.82 








0.33 


U, pain chronicity 


-0.15 


0.30 


-0.51 








-0.99 


A, site of pain 


-2.27 


0.91 


-2.48 








-2.33 


V, previous illnesses 


0.19 


0.11 


1.76 








1.24 


B, level of schooling 


-0.50 


0.78 


-0.64 








-0.22 


{Za - mean(Za))2 


0.18 


0.23 


3.41 


0.23 


0.05 


4.28 





i?2^^ = 0.54 Selected model Y -.Za + Zl Rl^^ = 0.49 



Response: Z^, intensity of pain after treatment; 


linear regression 






starting mo 


del 


selected 


excluded 


explanatory variables 


coeff Scooff 


^obs 


coeff Scocff ^obs 


z' 

^obs 


constant 


2.74 - 




2.98 - 




Zb-, pain intensity before 


0.12 0.08 


1.60 


0.16 0.07 2.16* 




Xfe, depression before 


0.03 0.02 


1.28 




1.76 


[/, pain chronicity 


0.11 0.14 


0.75 




1.43 


site of pain 


1.07 0.42 


2.51 


1.27 0.39 3.26 




V ^ previous illnesses 


0.00 0.05 


0.03 




0.83 


-B, level of schooling 


-0.19 0.37 


-0.52 




-0.70 



i?2^jj = 0.09 Selected model Za'. Z^ + A Rl^^ = 0.07 

*: depression before treatment needed because of the repeated measurement design; 

the low correlation for Z^, Z^ is due to a change in measuring, before and after treatment 



The procedure defines a selected model, unless one of the excluded variables has a 
contribution of l^obsl > 2.58 when added alone to the selected directly explanatory vari- 
ables, then such a variable needs also to be included as an important directly explanatory 
variable. This did not happen in the given data set. 

The tables show for linear models also i?^, the coefficient of determination, both for 
the full and for the selected model. Multiplied by 100, it gives the percentage of the 
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variation in the response explained by the modeL 



Response: X^, depression after treatment; linear regression 



starting model selected excluded 



explanatory variables 


coeff 


■^cocfE 


^obs 


coeff 


•S coeff 




^obs 


constant 


2.54 






4.55 








Zfe, pain intensity before 


-0.05 


0.22 


-0.23 








-0.21 


Xh, depression before 


0.62 


0.06 


10.43 


0.68 


0.05 


12.68 




U, pain chronicity 


0.96 


0.42 


2.28 








2.31 


A, site of pain 


-1.19 


1.25 


-0.95 








-0.10 


V, previous illnesses 


0.05 


0.15 


0.35 








1.08 


B, level of schooling 


0.15 


1.09 


0.14 








-0.01 



Rl = 0.46 Selected model Xa : Xb RL = 0.45 



Response: Zf,, intensity of pain before; linear 


regression 








starting model 


selected 




excluded 

^obs 


explanatory variables 


coeff Scoeflf ^obs 


coeff Scoeff 


■^obs 


constant 


7.60 - 


7.38 - 






U, pain chronicity 


0.10 0.13 0.77 






0.59 


A, site of pain 


-0.58 0.40 -1.44 






-1.20 


V, previous illnesses 


0.02 0.05 0.46 






0.72 


B, level of schooling 


-0.94 0.35 -2.70 


-0.89 0.33 


-2.65 




i?2^^ = 0.05 Selected 


model Za : B R^^ = 


= 0.03 







In the linear regression of Za on Xa and on the directly explanatory variables of 
both Za and Xa, that is on Z^, X^, A, the contribution of Xa leads to ^obs = 3.51, which 
coincides - by definition - with ^obs computed for the contribution of Za in the linear 
regression of Xa on Za and on Zb,Xb,A. Hence the two responses are correlated even 
after considering the directly explanatory variables and a dashed line joining Za and Zb 
is added to the well- fitting regression graph in Figure [HI 

In the linear regression of Zb on Xb and on the directly explanatory variables of both 

Zb and Xb, that is on U, A, V, B, the contribution of Xb leads to Zobs = 2.64. Hence the 

two responses are associated after considering their directly explanatory variables and 

there is a dashed line joining Zb and Xb in the regression graph of Figure [HI 

The relatively strict criterion, for excluding variables, assures that all edges in the 

derived regression graph correspond to dependences and dependences that are considered 
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to be substantive in the given context. Had instead a 0.975 quantile been chosen as 
threshold, then one arrow from A to Y and another from U to Xa would have been 
added to the regression graph. Though this would correspond to a better goodness-of- 
fit, such weak dependences are less likely to become confirmed as being important in 
follow-up studies. 



Response: X^, depression before; linear regression 



starting model selected excluded 



explanatory variables 


coeff 


•ScoefT 


^obs 


coeff 


Scoeff 


^obs 


^obs 


constant 


10.96 






7.31 








U, pain chronicity 


1.97 


0.49 


4.02 


1.78 


0.46 


3.87 




A, site of pain 


-2.33 


1.50 


-1.55 








-1.42 


V, previous illnesses 


0.54 


0.18 


2.99 


0.55 


0.18 


3.06 




B, level of schooling 


-1.10 


1.31 


-0.84 








-0.57 



= 0.18 Selected model Xb:U + V RL = 0.17 



Response: U, chronicity of pain; linear regression 



starting model selected excluded 



explanatory variables 


coeff 


•ScoefT 


-^obs 


coeff 


^coefE 


■^obs 


^obs 


constant 


2.93 






2.47 








A, site of pain 


0.95 


0.21 


4.58 


1.02 


0.20 


5.02 




V, previous illnesses 


0.14 


0.02 


5.83 


0.14 


0.02 


5.92 




B, level of schooling 


-0.27 


0.19 


-1.43 








-1.43 



Rh. = 0.26 Selected model Xb:A + V iiL = 0.25 



Response: A, site of pain; logistic regression 

starting model selected excluded 

explanatory variables coeff Scoeflf ^obs coeff Scocff ^obs ^obs 

constant 0.26 - - 0.60 - - 

V, previous illnesses 0.05 0.04 1.22 - - - 1.22 

level of schooling -1.25 0.40 -3.11 -1.28 0.40 -3.18 

Selected model A : B; response recoded to (0,1) instead of (1,2) 
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Response: V, previous 


illnesses; linear regression 






starting model selected 


excluded 


explanatory variables 


COefF Scoeflf ^obs COeff Scoeff ^obs 


^obs 


constant 

B, level of schooling 


6.41 - - 5.53 - 
-0.65 0.54 -1.20 - - 




Selected model V : — 



The subgraph induced by Za, Zb, Xa, Xi, of the regression graph in Figure [8] cor- 
responds to two seemingly unrelated regressions. Even though separate least-squares 
estimates can in principle be severely distorted, for the present data, the structure is 
so well-fitting in the unconstrained multivariate regression of Za and Xa on Zh, X^, 
U, V, A, B, that is in a simple covering model, that none of these potential problems are 
relevant. 

With C = {U,V, A, B}, this is evident from the observed covariance matrix of Za, Xa 
given Zb, Xb, C, denoted here by T,aa\bc and the observed regression coefficient matrix 
ftaife.c being almost identical to the corresponding m.l.e T,aa\bc and Ila\b.c- 

The former can be obtained by sweeping or partially inverting the observed covari- 
ance matrix of the eight variables with respect to Zb, Xb, C and the latter by using an 
adaption of the EM-algorithm, due to Kiiveri (1989), on the observed covariance matrix 
of the four symptoms, corrected for linear regression on C. In this way, one gets 

5.61 3.91 \ ^ _ / 5.66 3.94 

^aa\bC - I 3 3^ j , ^aa\bC " 3 ^g_^^ 

0.12 0.03 \ fr _ ( 0-14 0.00 
llaife.c - 1 _Q Q g2 J ' ^^"l''^ - 0.00 0.60 

The assumed definition of the joint distribution in terms of univariate and multi- 
variate regressions assures that the overall fit of the model can be judged locally in two 
steps. First, one compares each unconstrained, full regression of a single response with 
regressions constrained by some independences, that is by selecting a subset of directly 
explanatory variables from the list of the potentially explanatory variables. Next, one 
decides for each component pair of a joint response whether this pair is conditionally 
independent given their directly explanatory variables considered jointly. This can again 
be achieved by single univariate regressions, as illustrated above for the joint responses 
Za and Xa- 
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